authors: Ravi Konjeti, Lakshmi Bhavani and Aleksandar Kamenev
Weather Data for NY state from 2015 to 2020 inclusive, from 5 stations across the state
source: https://www.ncdc.noaa.gov/cdo-web/datasets
- STATION: Station identification code
- NAME: Name of the station (usually city/airport name)
- LATITUDE: (decimated degrees w/northern hemisphere values > 0, southern hemisphere values < 0)
- LONGITUDE: (decimated degrees w/western hemisphere values < 0, eastern hemisphere values > 0)
- ELEVATION: above mean sea level (tenths of meters)
- DATE: Year of the record (4 digits) followed by month (2 digits) and day (2 digits)
- AWND: Average daily wind speed
- PRCP: Precipitation
- SNOW: Snowfall
- SNWD: Snow depth
- TAVG: Average temperature
- TMAX: Maximum temperature
- TMIN: Minimum temperature
- WDF2: Direction of fastest 2-minute wind (degrees)
- WSF2: Fastest 2-minute wind speed
Please see http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt for a complete list of stations and their metadata.
import pandas as pd
import edapy.datainfo as d_info
import edapy.load_data as d_load
import edapy.plot as my_plot
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
# load in data
weather_df = d_load.load_df(r'data\dataset.csv')
# instantiate custom package class for eda
data = d_info.DataInfo(weather_df)
The method called below shows the first 5 records in the df so we can get a feel for the data. It also lists the features present and gives information about the feature name, number of non-null values, and data type.
data.describe()
| STATION | NAME | LATITUDE | LONGITUDE | ELEVATION | DATE | AWND | PRCP | SNOW | SNWD | TAVG | TMAX | TMIN | WDF2 | WSF2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-01-01 | 24.61 | 0.00 | 0.3 | 1.2 | 23.0 | 32.0 | 20.0 | 240 | 40.0 |
| 1 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-01-02 | 13.20 | 0.00 | 0.0 | 1.2 | 31.0 | 33.0 | 22.0 | 250 | 29.1 |
| 2 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-01-03 | 8.50 | 0.58 | 0.0 | 0.0 | 27.0 | 48.0 | 22.0 | 60 | 15.0 |
| 3 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-01-04 | 17.67 | 0.52 | 0.0 | 0.0 | 44.0 | 55.0 | 28.0 | 250 | 38.0 |
| 4 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-01-05 | 19.24 | 0.00 | 0.1 | 0.0 | 20.0 | 28.0 | 9.0 | 270 | 31.1 |
| LATITUDE | LONGITUDE | ELEVATION | AWND | PRCP | SNOW | SNWD | TAVG | TMAX | TMIN | WDF2 | WSF2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10960.000000 | 10960.000000 | 10960.000000 | 10957.000000 | 10959.000000 | 10960.000000 | 10960.000000 | 10593.000000 | 10959.000000 | 10959.000000 | 10960.000000 | 10960.000000 |
| mean | 41.865404 | -75.076166 | 163.080000 | 9.320228 | 0.118639 | 0.167527 | 0.736168 | 51.274615 | 59.714755 | 43.197737 | 220.053832 | 19.880648 |
| std | 0.969652 | 2.071932 | 177.556465 | 4.057914 | 0.292714 | 1.045389 | 2.729762 | 18.206960 | 19.317414 | 17.991548 | 94.062662 | 6.596222 |
| min | 40.639150 | -78.736060 | 2.700000 | 0.450000 | 0.000000 | 0.000000 | 0.000000 | -6.000000 | 1.000000 | -18.000000 | 10.000000 | 4.900000 |
| 25% | 40.793890 | -75.979930 | 25.400000 | 6.490000 | 0.000000 | 0.000000 | 0.000000 | 37.000000 | 44.000000 | 30.000000 | 170.000000 | 15.000000 |
| 50% | 42.206780 | -73.799130 | 85.400000 | 8.720000 | 0.000000 | 0.000000 | 0.000000 | 52.000000 | 61.000000 | 43.000000 | 230.000000 | 19.900000 |
| 75% | 42.747220 | -73.763900 | 216.200000 | 11.410000 | 0.070000 | 0.000000 | 0.000000 | 67.000000 | 77.000000 | 59.000000 | 300.000000 | 23.900000 |
| max | 42.939980 | -73.101810 | 485.700000 | 29.970000 | 4.020000 | 31.200000 | 39.000000 | 89.000000 | 99.000000 | 82.000000 | 360.000000 | 67.100000 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10960 entries, 0 to 10959 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 STATION 10960 non-null object 1 NAME 10960 non-null object 2 LATITUDE 10960 non-null float64 3 LONGITUDE 10960 non-null float64 4 ELEVATION 10960 non-null float64 5 DATE 10960 non-null object 6 AWND 10957 non-null float64 7 PRCP 10959 non-null float64 8 SNOW 10960 non-null float64 9 SNWD 10960 non-null float64 10 TAVG 10593 non-null float64 11 TMAX 10959 non-null float64 12 TMIN 10959 non-null float64 13 WDF2 10960 non-null int64 14 WSF2 10960 non-null float64 dtypes: float64(11), int64(1), object(3) memory usage: 1.3+ MB
None
From the above information, some key details about the data set present can be observed.
For the most part we have numeric values but there are some object time values.
STATION & NAME are strong candidates for conversion to String and also can potentially be seen as candidates for categorical encoding (one-hot encoding). However, since we will only be using these values for identification of the stations, the splitting of the data and no predictions will be made on what station the data point is from as this is not of interest to this research we will not focus on these two features beyond use for splitting and differentiation between station to perform comparisons on other features.
DATE is a good candidate for conversion to date-time format and base of comparison when doing time series analysis.
# to date-time format
data.to_date_time("DATE")
# While we can convert to string, two questions arise. Is it needed? Is one hot encoding not more beneficial.
# data.as_string("STATION", "NAME")
The method called below gives information in detail about the number of missing and present values per feature in the dataset as well as provide details about the possible values a feature may have and the number of times they occur. More on this will follow later on in the notebook when we visualize data per feature looking for the outlier.
data.nulls()
Show number of missing instances per feature --------------------
STATION 0 NAME 0 LATITUDE 0 LONGITUDE 0 ELEVATION 0 DATE 0 AWND 3 PRCP 1 SNOW 0 SNWD 0 TAVG 367 TMAX 1 TMIN 1 WDF2 0 WSF2 0 dtype: int64
Show number of present instances per feature --------------------
STATION 10960 NAME 10960 LATITUDE 10960 LONGITUDE 10960 ELEVATION 10960 DATE 10960 AWND 10957 PRCP 10959 SNOW 10960 SNWD 10960 TAVG 10593 TMAX 10959 TMIN 10959 WDF2 10960 WSF2 10960 dtype: int64
Shape (10960, 15)
** ** ** ** ** ** ** ** ** **
Feature: STATION
Number of unique values: 5
['USW00014733' 'USW00014735' 'USW00004725' 'USW00004781' 'USW00094789']
_ _ _ _ _ _ _ _ _ _ _ _ _ _
USW00014733 2192
USW00014735 2192
USW00004725 2192
USW00004781 2192
USW00094789 2192
Name: STATION, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: NAME
Number of unique values: 5
['BUFFALO NIAGARA INTERNATIONAL, NY US'
'ALBANY INTERNATIONAL AIRPORT, NY US' 'BINGHAMTON GREATER AP, NY US'
'ISLIP LI MACARTHUR AIRPORT, NY US' 'JFK INTERNATIONAL AIRPORT, NY US']
_ _ _ _ _ _ _ _ _ _ _ _ _ _
BUFFALO NIAGARA INTERNATIONAL, NY US 2192
ALBANY INTERNATIONAL AIRPORT, NY US 2192
BINGHAMTON GREATER AP, NY US 2192
ISLIP LI MACARTHUR AIRPORT, NY US 2192
JFK INTERNATIONAL AIRPORT, NY US 2192
Name: NAME, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: LATITUDE
Number of unique values: 5
[42.93998 42.74722 42.20678 40.79389 40.63915]
_ _ _ _ _ _ _ _ _ _ _ _ _ _
42.93998 2192
42.74722 2192
42.20678 2192
40.79389 2192
40.63915 2192
Name: LATITUDE, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: LONGITUDE
Number of unique values: 5
[-78.73606 -73.79913 -75.97993 -73.10181 -73.7639 ]
_ _ _ _ _ _ _ _ _ _ _ _ _ _
-78.73606 2192
-73.79913 2192
-75.97993 2192
-73.10181 2192
-73.76390 2192
Name: LONGITUDE, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: ELEVATION
Number of unique values: 5
[216.2 85.4 485.7 25.4 2.7]
_ _ _ _ _ _ _ _ _ _ _ _ _ _
216.2 2192
85.4 2192
485.7 2192
25.4 2192
2.7 2192
Name: ELEVATION, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: DATE
Number of unique values: 2192
['2015-01-01T00:00:00.000000000' '2015-01-02T00:00:00.000000000'
'2015-01-03T00:00:00.000000000' ... '2020-12-29T00:00:00.000000000'
'2020-12-30T00:00:00.000000000' '2020-12-31T00:00:00.000000000']
_ _ _ _ _ _ _ _ _ _ _ _ _ _
2015-01-01 5
2019-01-04 5
2018-12-29 5
2018-12-30 5
2018-12-31 5
..
2016-12-25 5
2016-12-24 5
2016-12-23 5
2016-12-22 5
2020-12-31 5
Name: DATE, Length: 2192, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: AWND
Number of unique values: 123
[24.61 13.2 8.5 17.67 19.24 10.74 16.55 19.46 18.34 11.86 10.96 8.05
7.83 16.33 13.65 9.84 17.22 12.3 4.92 4.7 14.09 8.95 13.42 6.93
10.51 9.17 14.54 11.18 6.26 15.66 3.8 12.97 10.29 8.28 10.07 12.53
5.82 9.62 15.21 14.76 5.59 11.41 6.04 14.32 11.63 6.71 13.87 16.11
4.03 7.16 17. 4.47 21.47 7.61 18.12 19.01 9.4 8.72 5.37 3.58
7.38 12.08 5.14 6.49 12.75 3.13 4.25 23.71 19.69 21.92 17.9 20.8
3.36 18.79 15.88 14.99 15.43 2.91 17.45 19.91 20.36 21.03 20.13 16.78
2.68 22.37 22.15 25.72 21.25 2.46 18.57 29.3 26.4 25.28 25.95 1.79
22.82 23.04 2.01 1.57 1.12 0.89 2.24 1.34 0.45 20.58 0.67 nan
24.83 23.49 21.7 23.94 27.74 25.05 23.26 24.16 25.5 28.19 29.97 28.41
22.59 27.51 24.38]
_ _ _ _ _ _ _ _ _ _ _ _ _ _
8.50 372
6.71 360
5.82 346
8.05 320
7.61 319
...
28.19 1
25.50 1
27.74 1
0.45 1
24.38 1
Name: AWND, Length: 122, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: PRCP
Number of unique values: 210
[0. 0.58 0.52 0.02 0.03 0.12 0.4 0.08 0.34 0.04 0.1 0.26 0.01 0.23
0.42 0.35 0.05 0.07 0.3 0.14 0.22 0.09 0.17 0.31 0.39 0.19 0.28 0.37
0.41 0.2 0.27 0.13 0.49 2.44 0.53 0.8 0.57 0.25 0.11 0.46 1.72 0.06
0.43 0.38 0.65 0.16 0.44 2.85 0.81 1.85 1. 0.36 0.76 0.18 1.64 0.15
0.33 0.73 0.51 0.24 0.66 0.89 0.94 0.29 0.32 0.71 1.04 0.72 1.54 1.76
1.1 1.3 0.54 0.83 0.21 0.86 0.69 0.6 0.64 1.58 1.95 1.07 1.29 1.37
1.79 0.77 2.29 1.27 1.25 0.67 0.82 0.63 1.81 0.91 2.01 0.5 0.59 1.33
0.45 0.87 0.88 1.32 1.4 1.15 0.47 1.46 0.55 0.48 0.68 0.9 1.12 1.22
0.74 2.14 0.92 1.18 1.43 1.13 0.78 0.95 0.56 0.62 1.03 1.21 0.61 0.79
1.48 1.62 1.84 1.57 1.11 2.74 0.7 1.2 2. 2.55 1.08 1.55 0.75 0.96
1.06 1.24 1.05 0.97 2.26 1.16 nan 1.28 1.59 0.98 1.6 1.63 0.99 0.84
3.92 1.02 1.38 1.73 1.91 1.23 1.01 1.17 1.65 1.36 0.85 2.18 1.09 1.35
1.51 1.34 1.49 1.47 2.7 1.66 1.45 1.19 1.39 0.93 1.8 1.53 2.02 1.5
1.69 1.44 1.41 2.3 2.13 1.68 1.92 1.87 4.02 3.11 1.7 1.52 1.26 1.61
2.05 2.75 2.48 2.9 2.47 1.42 1.94 2.07 1.78 1.14 2.84 2.16 1.71 2.33]
_ _ _ _ _ _ _ _ _ _ _ _ _ _
0.00 6572
0.01 488
0.02 328
0.03 227
0.04 183
...
2.02 1
1.44 1
1.41 1
2.30 1
2.33 1
Name: PRCP, Length: 209, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: SNOW
Number of unique values: 102
[ 0.3 0. 0.1 1. 0.8 3.8 6.5 2.9 3.6 0.2 1.6 0.9 1.8 0.6
4.4 8.2 0.7 5.4 2.3 1.1 2.4 2.7 1.4 0.5 6.9 2.1 0.4 1.5
1.9 1.2 2.2 1.3 7.4 3.3 5.8 4.9 8.9 2.8 3.9 9.6 6.3 3.4
5.6 3.2 13.2 1.7 2.6 4. 5. 2. 3.1 2.5 3.7 3.5 8.6 5.7
6.2 17.2 6.4 13.6 4.7 8.7 4.5 4.1 18.4 11.9 3. 6.1 4.6 4.3
11.2 7.6 17. 5.5 8.4 11.5 10.4 4.8 13.3 6.8 19.7 8.8 10.2 10.5
31.2 12.5 8. 7.2 5.1 26.4 7.5 17.4 5.3 23.4 9.8 9.7 14.3 16.
14.9 4.2 30.3 8.3]
_ _ _ _ _ _ _ _ _ _ _ _ _ _
0.0 9903
0.1 178
0.2 96
0.3 73
0.6 49
...
10.4 1
13.3 1
6.8 1
4.7 1
8.3 1
Name: SNOW, Length: 102, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: SNWD
Number of unique values: 31
[ 1.2 0. 2. 9.8 7.1 9.1 7.9 5.1 3.1 3.9 5.9 13. 16.1 18.1
20.9 22. 22.8 26. 25.2 24. 20.1 14.2 11.8 11. 15. 18.9 16.9 39.
31.1 29.1 28. ]
_ _ _ _ _ _ _ _ _ _ _ _ _ _
0.0 9514
1.2 328
2.0 225
3.1 170
3.9 116
5.1 111
5.9 75
7.9 67
7.1 64
9.1 45
9.8 31
11.8 27
13.0 25
11.0 24
22.0 17
16.1 16
14.2 14
18.1 14
15.0 14
20.9 12
20.1 11
16.9 11
18.9 10
22.8 6
26.0 4
25.2 3
24.0 2
39.0 1
31.1 1
29.1 1
28.0 1
Name: SNWD, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: TAVG
Number of unique values: 97
[23. 31. 27. 44. 20. 12. 11. 8. 16. 21. 30. 10. 7. 18. 25. 15. 39. 24.
26. 13. 14. 17. 2. -1. -5. 6. -4. 19. 5. 3. 22. 33. 9. 32. 36. 35.
42. 37. 28. 38. 34. 49. 47. 40. 53. 45. 60. 51. 52. 57. 43. 54. 55. 58.
65. 62. 66. 72. 76. 74. 73. 61. 59. 64. 68. 63. 70. 71. 67. 77. 48. 56.
69. 75. 79. 78. 50. 46. 29. nan 41. 81. 80. 82. 0. 4. 1. 85. 88. 84.
83. -2. -3. -6. 87. 86. 89.]
_ _ _ _ _ _ _ _ _ _ _ _ _ _
71.0 272
69.0 258
72.0 244
67.0 229
74.0 226
...
88.0 2
-5.0 2
-3.0 1
-6.0 1
89.0 1
Name: TAVG, Length: 96, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: TMAX
Number of unique values: 99
[32. 33. 48. 55. 28. 18. 17. 15. 19. 31. 21. 29. 30. 41. 44. 23. 22. 38.
27. 20. 13. 2. 5. 10. 12. 1. 24. 16. 26. 34. 25. 47. 51. 37. 45. 35.
42. 39. 46. 67. 52. 40. 36. 58. 50. 65. 66. 61. 78. 60. 71. 57. 70. 69.
68. 79. 76. 80. 88. 89. 81. 82. 53. 83. 63. 75. 85. 62. 54. 73. 84. 77.
74. 72. 64. 87. 86. 91. 59. 56. 49. 43. 11. 90. 92. 14. 9. 4. 93. 8.
7. 94. 98. 97. 95. 96. 6. nan 99.]
_ _ _ _ _ _ _ _ _ _ _ _ _ _
81.0 242
80.0 238
79.0 236
83.0 235
82.0 220
...
1.0 2
7.0 2
98.0 2
5.0 2
2.0 1
Name: TMAX, Length: 98, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: TMIN
Number of unique values: 96
[ 20. 22. 28. 9. 10. 2. 5. 14. 15. -4. 7. 1. 33. 12.
23. 24. 13. 8. 4. 3. 0. 6. -5. -10. -6. -8. -7. -1.
25. 19. 27. 29. 31. 35. 30. 16. 26. 18. 17. 32. 37. 39.
34. 46. 38. 40. 43. 44. 42. 47. 51. 54. 53. 59. 64. 67.
49. 41. 50. 60. 65. 62. 63. 58. 45. 56. 61. 57. 69. 55.
52. 71. 72. 66. 68. 70. 48. 36. 21. 11. -12. 74. 75. 76.
-3. 73. -2. -9. -13. 77. -18. nan 78. 79. 80. 82.]
_ _ _ _ _ _ _ _ _ _ _ _ _ _
32.0 243
30.0 235
63.0 233
34.0 229
60.0 227
...
-12.0 2
80.0 2
-13.0 1
-18.0 1
82.0 1
Name: TMIN, Length: 95, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: WDF2
Number of unique values: 36
[240 250 60 270 280 220 260 230 350 360 190 200 290 30 310 210 330 40
50 100 300 180 340 80 90 320 70 160 20 130 140 170 120 110 150 10]
_ _ _ _ _ _ _ _ _ _ _ _ _ _
280 519
290 519
240 492
300 485
180 466
170 449
190 446
220 442
330 436
310 429
230 421
320 412
340 387
250 370
270 368
210 358
350 354
200 333
260 331
160 325
360 278
30 243
20 242
150 241
40 225
140 189
60 158
10 157
50 148
130 129
70 123
100 122
120 110
110 108
90 94
80 51
Name: WDF2, dtype: int64
___________________________
** ** ** ** ** ** ** ** ** **
Feature: WSF2
Number of unique values: 45
[40. 29.1 15. 38. 31.1 21.9 30. 35.1 21. 25.1 16.1 18.1 19.9 10.1
8.9 23. 23.9 13. 25.9 28. 14.1 17. 36. 12.1 33.1 42.9 8.1 32.
40.9 36.9 38.9 46.1 45. 47. 49. 6.9 44.1 52.1 48.1 6. 67.1 4.9
11. 51. 53.9]
_ _ _ _ _ _ _ _ _ _ _ _ _ _
16.1 815
17.0 777
18.1 775
21.0 764
15.0 758
19.9 718
14.1 672
21.9 670
13.0 586
23.0 546
23.9 510
12.1 456
25.1 386
25.9 361
10.1 299
28.0 295
29.1 251
30.0 217
8.9 177
31.1 169
32.0 151
8.1 99
33.1 90
35.1 88
36.0 75
36.9 47
38.0 42
6.9 38
38.9 26
40.0 26
40.9 21
42.9 14
6.0 13
44.1 9
49.0 5
47.0 4
45.0 2
46.1 1
52.1 1
48.1 1
67.1 1
4.9 1
11.0 1
51.0 1
53.9 1
Name: WSF2, dtype: int64
___________________________
None
The method called below shows the instances where there are missing records separated based on feature which has missing values.
list_names = data.get_column_names_where_missing_data()
print(list_names)
data.show_missing(*list_names)
Index(['AWND', 'PRCP', 'TAVG', 'TMAX', 'TMIN'], dtype='object') Feature name: AWND
| STATION | NAME | LATITUDE | LONGITUDE | ELEVATION | DATE | AWND | PRCP | SNOW | SNWD | TAVG | TMAX | TMIN | WDF2 | WSF2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6049 | USW00004725 | BINGHAMTON GREATER AP, NY US | 42.20678 | -75.97993 | 485.7 | 2019-07-24 | NaN | 0.00 | 0.0 | 0.0 | 69.0 | 75.0 | 55.0 | 360 | 13.0 |
| 7349 | USW00004781 | ISLIP LI MACARTHUR AIRPORT, NY US | 40.79389 | -73.10181 | 25.4 | 2017-02-12 | NaN | 0.04 | 0.4 | 9.8 | 34.0 | 40.0 | 30.0 | 10 | 6.9 |
| 8170 | USW00004781 | ISLIP LI MACARTHUR AIRPORT, NY US | 40.79389 | -73.10181 | 25.4 | 2019-05-14 | NaN | 0.07 | 0.0 | 0.0 | 47.0 | 52.0 | 44.0 | 30 | 10.1 |
Feature name: PRCP
| STATION | NAME | LATITUDE | LONGITUDE | ELEVATION | DATE | AWND | PRCP | SNOW | SNWD | TAVG | TMAX | TMIN | WDF2 | WSF2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3734 | USW00014735 | ALBANY INTERNATIONAL AIRPORT, NY US | 42.74722 | -73.79913 | 85.4 | 2019-03-23 | 21.03 | NaN | 0.4 | 0.0 | 32.0 | 37.0 | 28.0 | 300 | 36.0 |
Feature name: TAVG
| STATION | NAME | LATITUDE | LONGITUDE | ELEVATION | DATE | AWND | PRCP | SNOW | SNWD | TAVG | TMAX | TMIN | WDF2 | WSF2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 334 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-12-01 | 7.61 | 0.10 | 0.0 | 0.0 | NaN | 53.0 | 39.0 | 230 | 18.1 |
| 335 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-12-02 | 6.71 | 0.02 | 0.0 | 0.0 | NaN | 49.0 | 34.0 | 240 | 17.0 |
| 336 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-12-03 | 11.41 | 0.01 | 0.0 | 0.0 | NaN | 44.0 | 33.0 | 280 | 21.0 |
| 337 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-12-04 | 11.63 | 0.00 | 0.0 | 0.0 | NaN | 44.0 | 38.0 | 240 | 28.0 |
| 338 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-12-05 | 4.92 | 0.00 | 0.0 | 0.0 | NaN | 45.0 | 31.0 | 230 | 13.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2095 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2020-09-26 | 8.50 | 0.00 | 0.0 | 0.0 | NaN | 80.0 | 57.0 | 210 | 16.1 |
| 2096 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2020-09-27 | 12.75 | 0.00 | 0.0 | 0.0 | NaN | 82.0 | 65.0 | 210 | 25.9 |
| 2097 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2020-09-28 | 10.51 | 0.01 | 0.0 | 0.0 | NaN | 80.0 | 62.0 | 220 | 31.1 |
| 2098 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2020-09-29 | 8.50 | 0.41 | 0.0 | 0.0 | NaN | 67.0 | 54.0 | 250 | 25.1 |
| 2099 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2020-09-30 | 14.32 | 1.72 | 0.0 | 0.0 | NaN | 62.0 | 51.0 | 210 | 33.1 |
367 rows × 15 columns
Feature name: TMAX
| STATION | NAME | LATITUDE | LONGITUDE | ELEVATION | DATE | AWND | PRCP | SNOW | SNWD | TAVG | TMAX | TMIN | WDF2 | WSF2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5584 | USW00004725 | BINGHAMTON GREATER AP, NY US | 42.20678 | -75.97993 | 485.7 | 2018-04-15 | 17.9 | 0.0 | 0.0 | 0.0 | 38.0 | NaN | NaN | 130 | 29.1 |
Feature name: TMIN
| STATION | NAME | LATITUDE | LONGITUDE | ELEVATION | DATE | AWND | PRCP | SNOW | SNWD | TAVG | TMAX | TMIN | WDF2 | WSF2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5584 | USW00004725 | BINGHAMTON GREATER AP, NY US | 42.20678 | -75.97993 | 485.7 | 2018-04-15 | 17.9 | 0.0 | 0.0 | 0.0 | 38.0 | NaN | NaN | 130 | 29.1 |
The method call below provides visualizations of the missing data in the dataset. This may be helpful to detect and show a correlation between missing values.
data.missing_bar_plot()
<AxesSubplot:>
<AxesSubplot:>
<AxesSubplot:>
From the above information we get a good idea from what data is missing and how to handle it.
As we are working with data points over a period of time and, each instance represents data for a particular date, it would be ideal not to lose/drop any rows. Therefore we will explore other options. There are quite a few build-in solutions to this problem. Back and forward fill and interpolate are a few such possibilities. However, we will use a custom approach to solve the issue of filling in missing data.
For features: AWND, PRCP, TMAX, and TMIN, the fact that there are data points from several geographically close sources for the same date, will be leveraged.
To fill in missing values in the TAVG column, we will leverage the fact that each instance has data for min and max temperature on the given date. Adding TMAX and TMIN and dividing the result in two will give a close enough approximation to the actual value. It will serve as a theoretical approximation with high probability.
From the above heatmap on the missing data it is clear that only in the case of TMIN and TMAX when one is missing so is the other.
Handle missing values in AWND, PRCP, TMAX, and TMIN
# get list of missing with out TAVG
list_names = list_names.drop('TAVG')
list_names
Index(['AWND', 'PRCP', 'TMAX', 'TMIN'], dtype='object')
The below method call replaces missing values for the features in the list with the average value for all instances of that date.
Sumation of n(1~i) / i
data.average_fill_in_missing(list_names)
The below call replaces missing value with average of two other features.
TAVG = (TMAX + TMIN) / 2
data.half_sum_fill_in_missing('TAVG')
Check to confirm data adjustments
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10960 entries, 0 to 10959 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 STATION 10960 non-null object 1 NAME 10960 non-null object 2 LATITUDE 10960 non-null float64 3 LONGITUDE 10960 non-null float64 4 ELEVATION 10960 non-null float64 5 DATE 10960 non-null datetime64[ns] 6 AWND 10960 non-null float64 7 PRCP 10960 non-null float64 8 SNOW 10960 non-null float64 9 SNWD 10960 non-null float64 10 TAVG 10960 non-null float64 11 TMAX 10960 non-null float64 12 TMIN 10960 non-null float64 13 WDF2 10960 non-null int64 14 WSF2 10960 non-null float64 dtypes: datetime64[ns](1), float64(11), int64(1), object(2) memory usage: 1.3+ MB
The below method call plots the histogram of each feature in the dataset. From this information, we can start to get an idea of how the data is distributed and so gain deeper insight.
my_plot.plot_hist(data.get_df())
The below boxplot show how the data is distributed over the possible range of values per feature and gives insight into outlier values.
my_plot.plot_outliers(data.get_df())
my_plot.plot_outlier_single_col_for_all(data.get_df())
<Figure size 432x288 with 0 Axes>
For the next part some more feature engineering will be required.
# extract and add features YEAR, MONTH and DAY from DATE
data.feature_extraction_date()
# extract ad add feature DAYS_SINCE_JAN_1
data.feature_extraction_days_since_jan_1()
# confurm operation
data.get_df()
| STATION | NAME | LATITUDE | LONGITUDE | ELEVATION | DATE | AWND | PRCP | SNOW | SNWD | TAVG | TMAX | TMIN | WDF2 | WSF2 | YEAR | MONTH | DAY | DAYS_SINCE_JAN_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-01-01 | 24.61 | 0.00 | 0.3 | 1.2 | 23.0 | 32.0 | 20.0 | 240 | 40.0 | 2015 | 1 | 1 | 0 |
| 1 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-01-02 | 13.20 | 0.00 | 0.0 | 1.2 | 31.0 | 33.0 | 22.0 | 250 | 29.1 | 2015 | 1 | 2 | 1 |
| 2 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-01-03 | 8.50 | 0.58 | 0.0 | 0.0 | 27.0 | 48.0 | 22.0 | 60 | 15.0 | 2015 | 1 | 3 | 2 |
| 3 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-01-04 | 17.67 | 0.52 | 0.0 | 0.0 | 44.0 | 55.0 | 28.0 | 250 | 38.0 | 2015 | 1 | 4 | 3 |
| 4 | USW00014733 | BUFFALO NIAGARA INTERNATIONAL, NY US | 42.93998 | -78.73606 | 216.2 | 2015-01-05 | 19.24 | 0.00 | 0.1 | 0.0 | 20.0 | 28.0 | 9.0 | 270 | 31.1 | 2015 | 1 | 5 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10955 | USW00094789 | JFK INTERNATIONAL AIRPORT, NY US | 40.63915 | -73.76390 | 2.7 | 2020-12-27 | 7.61 | 0.00 | 0.0 | 0.0 | 29.0 | 38.0 | 24.0 | 280 | 19.9 | 2020 | 12 | 27 | 361 |
| 10956 | USW00094789 | JFK INTERNATIONAL AIRPORT, NY US | 40.63915 | -73.76390 | 2.7 | 2020-12-28 | 10.96 | 0.00 | 0.0 | 0.0 | 41.0 | 51.0 | 33.0 | 270 | 19.9 | 2020 | 12 | 28 | 362 |
| 10957 | USW00094789 | JFK INTERNATIONAL AIRPORT, NY US | 40.63915 | -73.76390 | 2.7 | 2020-12-29 | 16.11 | 0.00 | 0.0 | 0.0 | 40.0 | 43.0 | 28.0 | 330 | 29.1 | 2020 | 12 | 29 | 363 |
| 10958 | USW00094789 | JFK INTERNATIONAL AIRPORT, NY US | 40.63915 | -73.76390 | 2.7 | 2020-12-30 | 9.62 | 0.00 | 0.0 | 0.0 | 33.0 | 43.0 | 24.0 | 210 | 19.9 | 2020 | 12 | 30 | 364 |
| 10959 | USW00094789 | JFK INTERNATIONAL AIRPORT, NY US | 40.63915 | -73.76390 | 2.7 | 2020-12-31 | 11.18 | 0.36 | 0.0 | 0.0 | 44.0 | 48.0 | 35.0 | 330 | 21.0 | 2020 | 12 | 31 | 365 |
10960 rows × 19 columns
Motiovation: Splitting the dataset based on station may be beneficial for some comparisons and nonrepetitive code.
# get unique values in STATION feature
list_of_stations = data.get_unique('STATION')
# make list of df split based on STATION name
df_list_station = data.get_list_dfs(list_of_stations, 'STATION')
Motiovation: Splitting the datasets of stations based on year may be beneficial for some comparisons and nonrepetitive code.
# gets a list of list. Each list on i level contains df split on station
# and each list on j level contains df split on year for that station
df_i_station_j_years = data.get_list_of_lists_df_split(df_list_station, 'YEAR')
# spearman corr for all feature ecept name, station and date
my_plot.plot_heat_map_per_feature(data.get_df(), 'spearman')
<Figure size 432x288 with 0 Axes>
my_plot.heat_map(data.get_df())
The above heat maps show strong correlations between certain key features. While there are apparent relations, there are some which are very interesting.
Strong corelations:
The below plots show minimum and maximum temperature over the period split by STATION.
for df_section in df_list_station:
my_plot.plot_temp_high_low(df_section)
The bellow plot shows minimum and maximum temperature and their relation. Color-coding based on data gatheriing station.
my_plot.join_plot_hue(data.get_df(), 'TMIN', 'TMAX', 'NAME')
The bellow plot is for TAVG over time period utalizing a more generic alternative custom plot method as opposed to the one used to plot the plots higher up in this notebook fot TMIN and TMAX over time
feature_list = ['TAVG']
for df_section in df_list_station:
my_plot.plot_many(df_section, 'DATE', 'NAME', feature_list, 'Temperature in degree F°')
my_plot.join_plot_hue(data.get_df(), 'DATE', 'TAVG', 'NAME')
The plot below shows precipitation over a year's period of time for each year in the df. One plot per data gathering station.
my_plot.plot_list_of_list(df_i_station_j_years, 'DAYS_SINCE_JAN_1', 'TAVG', 'YEAR', 'NAME',
'Passed days since Jan 1st', 'Average temperature (F°)')
Simular to the plots above but with cumulative sum over the period
my_plot.plot_list_of_list(df_i_station_j_years, 'DAYS_SINCE_JAN_1', 'TAVG', 'YEAR', 'NAME',
'Passed days since Jan 1st', 'Average temperature (F°)', 'cumsum')
From the above plots, an overall trend is observed as time progresses the difference between min and max temperature decreases, and both those values increase. Therefore, a trend of increase in temperature is present.
feature_list = ['PRCP']
for df_section in df_list_station:
my_plot.plot_many(df_section, 'DATE', 'NAME', feature_list, 'Precipitation in inches')
my_plot.join_plot_hue(data.get_df(), 'DATE', 'PRCP', 'NAME')
The plot below shows precipitation over a year's period for each year in the df. One plot per data gathering station
# plot precipitation over a year period for each year in the df.
# One plot per data gathering station
my_plot.plot_list_of_list(df_i_station_j_years, 'DAYS_SINCE_JAN_1', 'PRCP', 'YEAR', 'NAME',
'Passed days since Jan 1st', 'Precipitation (inch)')
Simular to the plots above but with cumulative sum over the period
my_plot.plot_list_of_list(df_i_station_j_years, 'DAYS_SINCE_JAN_1', 'PRCP', 'YEAR', 'NAME',
'Passed days since Jan 1st', 'Precipitation (inch)', 'cumsum')
only_first_last_years = [inner[::len(inner)-1] for inner in df_i_station_j_years]
my_plot.plot_list_of_list(only_first_last_years, 'DAYS_SINCE_JAN_1', 'PRCP', 'YEAR', 'NAME',
'Passed days since Jan 1st', 'Precipitation (inch)')
my_plot.plot_list_of_list(only_first_last_years, 'DAYS_SINCE_JAN_1', 'PRCP', 'YEAR', 'NAME',
'Passed days since Jan 1st', 'Precipitation (inch)', 'cumsum')
Note: Here we can make observations about how windy a particular part of the state is.
Fastest 2-minute wind speed in relation to Average daily wind speed. Color-coding based on data gathering station.
my_plot.join_plot_hue(data.get_df(), 'AWND', 'WSF2', 'NAME')
import json
import plotly.express as px
# get geo data from file
ny_state = json.load(open("data/geo/cb_2018_36_place_500k.geojson", 'r'))
# dict for the key id pairs
state_id_map = {}
# extract id
for feature in ny_state['features']:
feature['id'] = feature['properties']['GEOID']
state_id_map[feature['properties']['NAME']] = feature['id']
Experimental work
By hand encoding for the GEo map with plotly
# get county names
data.get_df()['COUNTY'] = data.get_df()['NAME'].apply(lambda x: x.split(" ")[0])
# clean up county names and match to ones from the geo data file
data.get_df()['COUNTY'] = data.get_df()['COUNTY'].apply(lambda x: 'Niagara Falls' if x == 'BUFFALO'
else 'New York' if x == 'JFK' else x.capitalize())
# get ids baased on dict
data.get_df()['id'] = data.get_df()['COUNTY'].apply(lambda x: state_id_map[x])
# add a sum of all TAVG values divided by total number of occurences per COUNTY unique value
list_of_counties = data.get_df()['COUNTY'].unique()
for county in list_of_counties:
tavg = data.get_df().loc[data.get_df()['COUNTY'] == county, 'TAVG'].sum()
number_of_records = data.get_df().loc[data.get_df()['COUNTY'] == 'New York', 'TAVG'].count()
sum_over_count = tavg/number_of_records
data.get_df().loc[data.get_df()['COUNTY'] == county, 'SUM_TAVG'] = sum_over_count
fig = px.choropleth(data.get_df(),
locations='id',
geojson=ny_state,
color='SUM_TAVG',
scope='usa',
hover_name='NAME',
hover_data=['SUM_TAVG'])
fig.update_geos(fitbounds="locations", visible=True)
fig.show()
center_lon = data.get_df().loc[data.get_df()['COUNTY'] == 'Binghamton', 'LONGITUDE'].tolist()[0]
center_lat = data.get_df().loc[data.get_df()['COUNTY'] == 'Binghamton', 'LATITUDE'].tolist()[0]
fig = px.choropleth_mapbox(data.get_df(),
locations='id',
geojson=ny_state,
color='SUM_TAVG',
hover_name='NAME',
hover_data=['SUM_TAVG'],
mapbox_style='carto-positron',
zoom = 5.8,
center={'lat': center_lat, 'lon': center_lon})
fig.show()